Scalable Probabilistic Entity-Topic Modeling
نویسندگان
چکیده
We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memoryfrugal processing of large datasets. We report state-of-the-art performance on a public dataset.
منابع مشابه
A Scalable Gibbs Sampler for Probabilistic Entity Linking
Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with th...
متن کاملAnchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms
Spectral methods offer scalable alternatives to Markov chain Monte Carlo and expectation maximization. However, these new methods lack the rich priors associated with probabilistic models. We examine Arora et al.’s anchor words algorithm for topic modeling and develop new, regularized algorithms that not only mathematically resemble Gaussian and Dirichlet priors but also improve the interpretab...
متن کاملGrounding Topic Models with Knowledge Bases
Topic models represent latent topics as probability distributions over words which can be hard to interpret due to the lack of grounded semantics. In this paper, we propose a structured topic representation based on an entity taxonomy from a knowledge base. A probabilistic model is developed to infer both hidden topics and entities from text corpora. Each topic is equipped with a random walk ov...
متن کاملMixed Membership Word Embeddings: Corpus-Specific Embeddings Without Big Data
Word embeddings provide a nuanced representation of words which can improve the performance of NLP systems by revealing the hidden structural properties of words and their relationships to each other. These models have recently risen in popularity due to the successful performance of scalable algorithms trained in the big data setting. Consequently, word embeddings are commonly trained on very ...
متن کاملMixed Membership Word Embeddings for Computational Social Science
Word embeddings improve the performance of NLP systems by revealing the hidden structural relationships between words. These models have recently risen in popularity due to the performance of scalable algorithms trained in the big data setting. Despite their success, word embeddings have seen very little use in computational social science NLP tasks, presumably due to their reliance on big data...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1309.0337 شماره
صفحات -
تاریخ انتشار 2013